The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
In this section, we need to consolidate all necessary libraries that will be used. This includes all that are needed for data analysis (e.g., pandas, numpy), data visualization (plotly), and for modeling (e.g., sci-kit learn).
For this project, we're required to develop 5 models. The classification models that I've chosen are:
import warnings
warnings.filterwarnings("ignore")
import os
# data and analysis libraries
import pandas as pd
import numpy as np
from sklearn import metrics
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import scipy.stats as stats
from scipy.stats import uniform, randint
#import data visualization libraries
import plotly.express as px
from scipy.stats import skew
import plotly.graph_objects as go
#model libraries
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict, StratifiedKFold
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
fbeta_score,
make_scorer
)
# data preprocessing libraries
# To be used for data scaling and encoding
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
OneHotEncoder,
RobustScaler,
)
# 5 models to be used: BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, XGBoost, Stacking
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
#hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
import optuna
# oversampling and undersampling data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
#data treatment library
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
# ability to import from github repository
import certifi
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
Loading the dataset. Choosing different varieties of sources, including learning how to leverage data from a SQL database, and utilizing GitHub.
url = "https://raw.githubusercontent.com/wesliejh/Project-3---Credit-Card-Churn/main/BankChurners.csv"
bank_data = pd.read_csv(url)
data = bank_data.copy()
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
data.shape
(10127, 21)
Observations
objectGender, Education_Level, Marital_Status, Income_Category, and Card_CategoryEducation_Level, and Marital_Status have missing values.data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | 7.391776e+08 | 3.690378e+07 | 708082083.0 | 7.130368e+08 | 7.179264e+08 | 7.731435e+08 | 8.283431e+08 |
| Customer_Age | 10127.0 | 4.632596e+01 | 8.016814e+00 | 26.0 | 4.100000e+01 | 4.600000e+01 | 5.200000e+01 | 7.300000e+01 |
| Dependent_count | 10127.0 | 2.346203e+00 | 1.298908e+00 | 0.0 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 |
| Months_on_book | 10127.0 | 3.592841e+01 | 7.986416e+00 | 13.0 | 3.100000e+01 | 3.600000e+01 | 4.000000e+01 | 5.600000e+01 |
| Total_Relationship_Count | 10127.0 | 3.812580e+00 | 1.554408e+00 | 1.0 | 3.000000e+00 | 4.000000e+00 | 5.000000e+00 | 6.000000e+00 |
| Months_Inactive_12_mon | 10127.0 | 2.341167e+00 | 1.010622e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Contacts_Count_12_mon | 10127.0 | 2.455317e+00 | 1.106225e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Credit_Limit | 10127.0 | 8.631954e+03 | 9.088777e+03 | 1438.3 | 2.555000e+03 | 4.549000e+03 | 1.106750e+04 | 3.451600e+04 |
| Total_Revolving_Bal | 10127.0 | 1.162814e+03 | 8.149873e+02 | 0.0 | 3.590000e+02 | 1.276000e+03 | 1.784000e+03 | 2.517000e+03 |
| Avg_Open_To_Buy | 10127.0 | 7.469140e+03 | 9.090685e+03 | 3.0 | 1.324500e+03 | 3.474000e+03 | 9.859000e+03 | 3.451600e+04 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 7.599407e-01 | 2.192068e-01 | 0.0 | 6.310000e-01 | 7.360000e-01 | 8.590000e-01 | 3.397000e+00 |
| Total_Trans_Amt | 10127.0 | 4.404086e+03 | 3.397129e+03 | 510.0 | 2.155500e+03 | 3.899000e+03 | 4.741000e+03 | 1.848400e+04 |
| Total_Trans_Ct | 10127.0 | 6.485869e+01 | 2.347257e+01 | 10.0 | 4.500000e+01 | 6.700000e+01 | 8.100000e+01 | 1.390000e+02 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 7.122224e-01 | 2.380861e-01 | 0.0 | 5.820000e-01 | 7.020000e-01 | 8.180000e-01 | 3.714000e+00 |
| Avg_Utilization_Ratio | 10127.0 | 2.748936e-01 | 2.756915e-01 | 0.0 | 2.300000e-02 | 1.760000e-01 | 5.030000e-01 | 9.990000e-01 |
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
data.duplicated().sum()
0
data.isnull().sum()/len(data)*100
CLIENTNUM 0.000000 Attrition_Flag 0.000000 Customer_Age 0.000000 Gender 0.000000 Dependent_count 0.000000 Education_Level 14.999506 Marital_Status 7.396070 Income_Category 0.000000 Card_Category 0.000000 Months_on_book 0.000000 Total_Relationship_Count 0.000000 Months_Inactive_12_mon 0.000000 Contacts_Count_12_mon 0.000000 Credit_Limit 0.000000 Total_Revolving_Bal 0.000000 Avg_Open_To_Buy 0.000000 Total_Amt_Chng_Q4_Q1 0.000000 Total_Trans_Amt 0.000000 Total_Trans_Ct 0.000000 Total_Ct_Chng_Q4_Q1 0.000000 Avg_Utilization_Ratio 0.000000 dtype: float64
Education_Level and Marital_Status have 15%, and 7% missing, respectively # create variable for only number dtypes
data_int = data.select_dtypes(include='number')
#create variable for all other dtypes
data_cat = data.select_dtypes(exclude='number')
data.plot(kind = 'box', subplots=True, figsize = (10,40), layout = (19,2), sharex=False, sharey = False)
plt.show();
Observations
data_int.hist(layout = (13,3), figsize=(8, 30))
plt.show();
Attrition_Flag¶data['Attrition_Flag'].value_counts()
Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64
data['Attrition_Flag'].value_counts()/len(data)*100
Attrition_Flag Existing Customer 83.934038 Attrited Customer 16.065962 Name: count, dtype: float64
labeled_barplot(data, 'Attrition_Flag')
Gender¶data['Gender'].value_counts()
Gender F 5358 M 4769 Name: count, dtype: int64
data['Gender'].value_counts()/len(data)*100
Gender F 52.908068 M 47.091932 Name: count, dtype: float64
labeled_barplot(data, 'Gender')
Education_Level¶data['Education_Level'].value_counts()
Education_Level Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64
data['Education_Level'].value_counts()/len(data)*100
Education_Level Graduate 30.887726 High School 19.877555 Uneducated 14.683519 College 10.002962 Post-Graduate 5.095290 Doctorate 4.453441 Name: count, dtype: float64
labeled_barplot(data, 'Education_Level')
Education_LevelMarital_Status¶data['Marital_Status'].value_counts()
Marital_Status Married 4687 Single 3943 Divorced 748 Name: count, dtype: int64
data['Marital_Status'].value_counts()/len(data)*100
Marital_Status Married 46.282216 Single 38.935519 Divorced 7.386195 Name: count, dtype: float64
data['Marital_Status'].unique()
array(['Married', 'Single', nan, 'Divorced'], dtype=object)
data['Marital_Status'].isnull().sum()/len(data)*100
7.3960699121161255
labeled_barplot(data, 'Marital_Status')
Income_Category¶data['Income_Category'].value_counts()
Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64
data['Income_Category'].unique()
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
'$120K +', 'abc'], dtype=object)
data['Income_Category'].value_counts()/len(data)*100
Income_Category Less than $40K 35.163425 $40K - $60K 17.675521 $80K - $120K 15.157500 $60K - $80K 13.844179 abc 10.980547 $120K + 7.178829 Name: count, dtype: float64
labeled_barplot(data, 'Income_Category')
Card_Category¶data['Card_Category'].value_counts()
Card_Category Blue 9436 Silver 555 Gold 116 Platinum 20 Name: count, dtype: int64
# creating function for a quick overview of data around the quartile information
def outlier_review(data, column):
"""
Function to review outliers in a column
data: dataframe
column: column name
"""
q1 = data[column].quantile(0.25)
q3 = data[column].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)][column]
num_above = data[data[column] > upper_bound].shape[0]
num_below = data[data[column] < lower_bound].shape[0]
print(
"The number of outliers in "
+ column
+ " is "
+ str(data[(data[column] < lower_bound) | (data[column] > upper_bound)][
column
].count())
)
print()
print("The upperbound value is: ", upper_bound)
print("The lowerbound value is: ", lower_bound)
print()
print("The number of points above the upper bound is " + str(num_above))
print("The number of points below the lower bound is " + str(num_below))
print()
print("Quick overview of outliers:\n", outliers, sep="")
# creating a function for a quick data review of a specific column
def data_review(data, column):
"""
Function to review data in a column
data: dataframe
column: column name
"""
print("The number of missing values is: ", data[column].isnull().sum())
print()
print("The number of unique values is: ", data[column].nunique())
print()
print("The data type is: ", data[column].dtype)
print()
print("The data description: \n", data[column].describe().T, sep="")
print()
print("The percentage of data points amongst the column is:\n", data[column].value_counts()/len(data)*100)
def review(data, column):
histogram_boxplot(data, column)
outlier_review(data, column)
data_review(data, column)
skewness = skew(data[column])
print(f"Skewness of {column}: {skewness}")
CLIENTNUM¶data['CLIENTNUM'].nunique()
10127
CLIENTNUM has the same amonut of unique values as the number of rows in datadata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
CLIENTNUM was dropped by reviewing infoCustomer_Age¶review(data, 'Customer_Age')
The number of outliers in Customer_Age is 2 The upperbound value is: 68.5 The lowerbound value is: 24.5 The number of points above the upper bound is 2 The number of points below the lower bound is 0 Quick overview of outliers: 251 73 254 70 Name: Customer_Age, dtype: int64 The number of missing values is: 0 The number of unique values is: 45 The data type is: int64 The data description: count 10127.000000 mean 46.325960 std 8.016814 min 26.000000 25% 41.000000 50% 46.000000 75% 52.000000 max 73.000000 Name: Customer_Age, dtype: float64 The percentage of data points amongst the column is: Customer_Age 44 4.937296 49 4.887923 46 4.838550 45 4.799052 47 4.729930 43 4.670682 48 4.660808 50 4.463316 42 4.206576 51 3.930088 53 3.821467 41 3.742471 52 3.712847 40 3.564728 39 3.288239 54 3.031500 38 2.992002 55 2.755011 56 2.587143 37 2.567394 57 2.202034 36 2.182285 35 1.816925 59 1.550311 58 1.550311 34 1.441691 33 1.254073 60 1.254073 32 1.046707 65 0.997334 61 0.918337 62 0.918337 31 0.898588 26 0.770218 30 0.691221 63 0.641849 29 0.552977 64 0.424607 27 0.315987 28 0.286363 67 0.039498 66 0.019749 68 0.019749 70 0.009875 73 0.009875 Name: count, dtype: float64 Skewness of Customer_Age: -0.03360003857464426
Dependent_count¶review(data, 'Dependent_count')
The number of outliers in Dependent_count is 0 The upperbound value is: 6.0 The lowerbound value is: -2.0 The number of points above the upper bound is 0 The number of points below the lower bound is 0 Quick overview of outliers: Series([], Name: Dependent_count, dtype: int64) The number of missing values is: 0 The number of unique values is: 6 The data type is: int64 The data description: count 10127.000000 mean 2.346203 std 1.298908 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 5.000000 Name: Dependent_count, dtype: float64 The percentage of data points amongst the column is: Dependent_count 3 26.977387 2 26.217044 1 18.149501 4 15.542609 0 8.926632 5 4.186827 Name: count, dtype: float64 Skewness of Dependent_count: -0.02082245083419453
Months_on_book¶review(data, 'Months_on_book')
The number of outliers in Months_on_book is 386
The upperbound value is: 53.5
The lowerbound value is: 17.5
The number of points above the upper bound is 198
The number of points below the lower bound is 188
Quick overview of outliers:
11 54
18 56
27 56
39 56
52 54
..
10054 15
10062 17
10069 14
10107 54
10114 15
Name: Months_on_book, Length: 386, dtype: int64
The number of missing values is: 0
The number of unique values is: 44
The data type is: int64
The data description:
count 10127.000000
mean 35.928409
std 7.986416
min 13.000000
25% 31.000000
50% 36.000000
75% 40.000000
max 56.000000
Name: Months_on_book, dtype: float64
The percentage of data points amongst the column is:
Months_on_book
36 24.321122
37 3.535104
34 3.485731
38 3.426484
39 3.367236
40 3.288239
31 3.140120
35 3.130246
33 3.011751
30 2.962378
41 2.932754
32 2.853757
28 2.715513
43 2.695764
42 2.676015
29 2.379777
44 2.271156
45 2.241533
27 2.034166
46 1.945295
26 1.836674
47 1.688555
25 1.629308
48 1.599684
24 1.579935
49 1.392318
23 1.145453
22 1.036832
56 1.017083
50 0.947961
21 0.819591
51 0.789967
53 0.770218
20 0.730720
13 0.691221
19 0.622099
52 0.612225
18 0.572726
54 0.523353
55 0.414733
17 0.385109
15 0.335736
16 0.286363
14 0.157993
Name: count, dtype: float64
Skewness of Months_on_book: -0.1065495749017217
Total_Relationship_Count¶review(data, 'Total_Relationship_Count')
The number of outliers in Total_Relationship_Count is 0 The upperbound value is: 8.0 The lowerbound value is: 0.0 The number of points above the upper bound is 0 The number of points below the lower bound is 0 Quick overview of outliers: Series([], Name: Total_Relationship_Count, dtype: int64) The number of missing values is: 0 The number of unique values is: 6 The data type is: int64 The data description: count 10127.000000 mean 3.812580 std 1.554408 min 1.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 6.000000 Name: Total_Relationship_Count, dtype: float64 The percentage of data points amongst the column is: Total_Relationship_Count 3 22.760936 4 18.880221 5 18.672855 6 18.425990 2 12.274119 1 8.985879 Name: count, dtype: float64 Skewness of Total_Relationship_Count: -0.16242835172024658
Months_Inactive_12_mon¶review(data, 'Months_Inactive_12_mon')
The number of outliers in Months_Inactive_12_mon is 331
The upperbound value is: 4.5
The lowerbound value is: 0.5
The number of points above the upper bound is 302
The number of points below the lower bound is 29
Quick overview of outliers:
12 6
29 0
31 5
108 0
118 6
..
9964 5
10028 5
10035 6
10049 5
10066 6
Name: Months_Inactive_12_mon, Length: 331, dtype: int64
The number of missing values is: 0
The number of unique values is: 7
The data type is: int64
The data description:
count 10127.000000
mean 2.341167
std 1.010622
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 6.000000
Name: Months_Inactive_12_mon, dtype: float64
The percentage of data points amongst the column is:
Months_Inactive_12_mon
3 37.977683
2 32.408413
1 22.049965
4 4.295448
5 1.757677
6 1.224449
0 0.286363
Name: count, dtype: float64
Skewness of Months_Inactive_12_mon: 0.6329673568012449
Contacts_Count_12_mon¶review(data, 'Contacts_Count_12_mon')
The number of outliers in Contacts_Count_12_mon is 629
The upperbound value is: 4.5
The lowerbound value is: 0.5
The number of points above the upper bound is 230
The number of points below the lower bound is 399
Quick overview of outliers:
2 0
4 0
8 0
12 0
20 0
..
10101 5
10106 5
10109 5
10114 5
10120 0
Name: Contacts_Count_12_mon, Length: 629, dtype: int64
The number of missing values is: 0
The number of unique values is: 7
The data type is: int64
The data description:
count 10127.000000
mean 2.455317
std 1.106225
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 6.000000
Name: Contacts_Count_12_mon, dtype: float64
The percentage of data points amongst the column is:
Contacts_Count_12_mon
3 33.376123
2 31.865311
1 14.802014
4 13.745433
0 3.939962
5 1.737928
6 0.533228
Name: count, dtype: float64
Skewness of Contacts_Count_12_mon: 0.011003996010760743
Credit_Limit¶review(data, 'Credit_Limit')
The number of outliers in Credit_Limit is 984
The upperbound value is: 23836.25
The lowerbound value is: -10213.75
The number of points above the upper bound is 984
The number of points below the lower bound is 0
Quick overview of outliers:
6 34516.0
7 29081.0
16 30367.0
40 32426.0
45 34516.0
...
10098 34516.0
10100 29808.0
10104 29663.0
10110 34516.0
10112 34516.0
Name: Credit_Limit, Length: 984, dtype: float64
The number of missing values is: 0
The number of unique values is: 6205
The data type is: float64
The data description:
count 10127.000000
mean 8631.953698
std 9088.776650
min 1438.300000
25% 2555.000000
50% 4549.000000
75% 11067.500000
max 34516.000000
Name: Credit_Limit, dtype: float64
The percentage of data points amongst the column is:
Credit_Limit
34516.0 5.016293
1438.3 5.006418
9959.0 0.177743
15987.0 0.177743
23981.0 0.118495
...
9183.0 0.009875
29923.0 0.009875
9551.0 0.009875
11558.0 0.009875
10388.0 0.009875
Name: count, Length: 6205, dtype: float64
Skewness of Credit_Limit: 1.6664789242587705
Total_Revolving_Bal¶review(data, 'Total_Revolving_Bal')
The number of outliers in Total_Revolving_Bal is 0
The upperbound value is: 3921.5
The lowerbound value is: -1778.5
The number of points above the upper bound is 0
The number of points below the lower bound is 0
Quick overview of outliers:
Series([], Name: Total_Revolving_Bal, dtype: int64)
The number of missing values is: 0
The number of unique values is: 1974
The data type is: int64
The data description:
count 10127.000000
mean 1162.814061
std 814.987335
min 0.000000
25% 359.000000
50% 1276.000000
75% 1784.000000
max 2517.000000
Name: Total_Revolving_Bal, dtype: float64
The percentage of data points amongst the column is:
Total_Revolving_Bal
0 24.390244
2517 5.016293
1965 0.118495
1480 0.118495
1434 0.108621
...
2467 0.009875
2131 0.009875
2400 0.009875
2144 0.009875
2241 0.009875
Name: count, Length: 1974, dtype: float64
Skewness of Total_Revolving_Bal: -0.14881520376464566
Avg_Open_To_Buy¶review(data, 'Avg_Open_To_Buy')
The number of outliers in Avg_Open_To_Buy is 963
The upperbound value is: 22660.75
The lowerbound value is: -11477.25
The number of points above the upper bound is 963
The number of points below the lower bound is 0
Quick overview of outliers:
6 32252.0
7 27685.0
16 28005.0
40 31848.0
45 34516.0
...
10100 29808.0
10103 22754.0
10104 27920.0
10110 33425.0
10112 34516.0
Name: Avg_Open_To_Buy, Length: 963, dtype: float64
The number of missing values is: 0
The number of unique values is: 6813
The data type is: float64
The data description:
count 10127.000000
mean 7469.139637
std 9090.685324
min 3.000000
25% 1324.500000
50% 3474.000000
75% 9859.000000
max 34516.000000
Name: Avg_Open_To_Buy, dtype: float64
The percentage of data points amongst the column is:
Avg_Open_To_Buy
1438.3 3.199368
34516.0 0.967710
31999.0 0.256739
787.0 0.078997
701.0 0.069122
...
6543.0 0.009875
2808.0 0.009875
21549.0 0.009875
6189.0 0.009875
8427.0 0.009875
Name: count, Length: 6813, dtype: float64
Skewness of Avg_Open_To_Buy: 1.6614504071556497
Total_Amt_Chng_Q4_Q1¶review(data, 'Total_Amt_Chng_Q4_Q1')
The number of outliers in Total_Amt_Chng_Q4_Q1 is 396
The upperbound value is: 1.201
The lowerbound value is: 0.28900000000000003
The number of points above the upper bound is 348
The number of points below the lower bound is 48
Quick overview of outliers:
0 1.335
1 1.541
2 2.594
3 1.405
4 2.175
...
9793 0.225
9808 0.202
9963 0.222
10008 0.204
10119 0.166
Name: Total_Amt_Chng_Q4_Q1, Length: 396, dtype: float64
The number of missing values is: 0
The number of unique values is: 1158
The data type is: float64
The data description:
count 10127.000000
mean 0.759941
std 0.219207
min 0.000000
25% 0.631000
50% 0.736000
75% 0.859000
max 3.397000
Name: Total_Amt_Chng_Q4_Q1, dtype: float64
The percentage of data points amongst the column is:
Total_Amt_Chng_Q4_Q1
0.791 0.355485
0.712 0.335736
0.743 0.335736
0.718 0.325862
0.735 0.325862
...
1.216 0.009875
1.645 0.009875
1.089 0.009875
2.103 0.009875
0.166 0.009875
Name: count, Length: 1158, dtype: float64
Skewness of Total_Amt_Chng_Q4_Q1: 1.7318068495622156
Total_Trans_Amt¶review(data, 'Total_Trans_Amt')
The number of outliers in Total_Trans_Amt is 896
The upperbound value is: 8619.25
The lowerbound value is: -1722.75
The number of points above the upper bound is 896
The number of points below the lower bound is 0
Quick overview of outliers:
8591 8693
8650 8947
8670 8854
8708 8796
8734 8778
...
10121 14596
10122 15476
10123 8764
10124 10291
10126 10294
Name: Total_Trans_Amt, Length: 896, dtype: int64
The number of missing values is: 0
The number of unique values is: 5033
The data type is: int64
The data description:
count 10127.000000
mean 4404.086304
std 3397.129254
min 510.000000
25% 2155.500000
50% 3899.000000
75% 4741.000000
max 18484.000000
Name: Total_Trans_Amt, dtype: float64
The percentage of data points amongst the column is:
Total_Trans_Amt
4253 0.108621
4509 0.108621
4518 0.098746
2229 0.098746
4220 0.088871
...
1274 0.009875
4521 0.009875
3231 0.009875
4394 0.009875
10294 0.009875
Name: count, Length: 5033, dtype: float64
Skewness of Total_Trans_Amt: 2.0407010789778317
Total_Trans_Ct¶review(data, 'Total_Trans_Ct')
The number of outliers in Total_Trans_Ct is 2
The upperbound value is: 135.0
The lowerbound value is: -9.0
The number of points above the upper bound is 2
The number of points below the lower bound is 0
Quick overview of outliers:
9324 139
9586 138
Name: Total_Trans_Ct, dtype: int64
The number of missing values is: 0
The number of unique values is: 126
The data type is: int64
The data description:
count 10127.000000
mean 64.858695
std 23.472570
min 10.000000
25% 45.000000
50% 67.000000
75% 81.000000
max 139.000000
Name: Total_Trans_Ct, dtype: float64
The percentage of data points amongst the column is:
Total_Trans_Ct
81 2.053915
71 2.004542
75 2.004542
69 1.994668
82 1.994668
...
11 0.019749
134 0.009875
139 0.009875
138 0.009875
132 0.009875
Name: count, Length: 126, dtype: float64
Skewness of Total_Trans_Ct: 0.1536503056777963
Total_Ct_Chng_Q4_Q1¶review(data, 'Total_Ct_Chng_Q4_Q1')
The number of outliers in Total_Ct_Chng_Q4_Q1 is 394
The upperbound value is: 1.172
The lowerbound value is: 0.22799999999999998
The number of points above the upper bound is 298
The number of points below the lower bound is 96
Quick overview of outliers:
0 1.625
1 3.714
2 2.333
3 2.333
4 2.500
...
9388 0.176
9672 1.294
9856 1.211
9917 1.207
9977 1.684
Name: Total_Ct_Chng_Q4_Q1, Length: 394, dtype: float64
The number of missing values is: 0
The number of unique values is: 830
The data type is: float64
The data description:
count 10127.000000
mean 0.712222
std 0.238086
min 0.000000
25% 0.582000
50% 0.702000
75% 0.818000
max 3.714000
Name: Total_Ct_Chng_Q4_Q1, dtype: float64
The percentage of data points amongst the column is:
Total_Ct_Chng_Q4_Q1
0.667 1.688555
1.000 1.639182
0.500 1.589809
0.750 1.540436
0.600 1.115829
...
0.827 0.009875
0.343 0.009875
1.579 0.009875
0.125 0.009875
0.359 0.009875
Name: count, Length: 830, dtype: float64
Skewness of Total_Ct_Chng_Q4_Q1: 2.063724833411372
Avg_Utilization_Ratio¶review(data, 'Avg_Utilization_Ratio')
The number of outliers in Avg_Utilization_Ratio is 0
The upperbound value is: 1.2229999999999999
The lowerbound value is: -0.697
The number of points above the upper bound is 0
The number of points below the lower bound is 0
Quick overview of outliers:
Series([], Name: Avg_Utilization_Ratio, dtype: float64)
The number of missing values is: 0
The number of unique values is: 964
The data type is: float64
The data description:
count 10127.000000
mean 0.274894
std 0.275691
min 0.000000
25% 0.023000
50% 0.176000
75% 0.503000
max 0.999000
Name: Avg_Utilization_Ratio, dtype: float64
The percentage of data points amongst the column is:
Avg_Utilization_Ratio
0.000 24.390244
0.073 0.434482
0.057 0.325862
0.048 0.315987
0.060 0.296238
...
0.927 0.009875
0.935 0.009875
0.954 0.009875
0.385 0.009875
0.009 0.009875
Name: count, Length: 964, dtype: float64
Skewness of Avg_Utilization_Ratio: 0.7179016418496336
fig = px.imshow(data_int.corr(), text_auto=True, template='plotly_dark', color_continuous_scale=px.colors.sequential.Blues, aspect = 'auto', title = '<b>Correlation Matrix')
fig.update_layout(title_x=0.5)
fig.show()
sns.pairplot(data, hue = 'Attrition_Flag', diag_kind='kde', kind='scatter', palette='husl')
plt.show();